1 Learning objectives

  1. You will use the mutate() function from the {dplyr} package to create new variables or modify inplace variables.

  2. You will create new numeric, character, factor, boolean, and date variables

  3. You will use the case_when() function from the {dplyr} package to create new variables based on conditions.

  4. You will know an additional handy argument of the mutate() function: .keep.

  5. You will mutate across multiple columns using the across() function from the {dplyr} package.


2 The Yaounde COVID-19 dataset

In this lesson, we will again use the data from the COVID-19 serological survey conducted in Yaounde, Cameroon.

yaounde <- read_csv(here::here('ch04_data_wrangling/data/yaounde_data.csv'))
## a smaller subset of variables
yao <- yaounde %>% select(date_surveyed, 
                          age, 
                          weight_kg, height_cm, 
                          symptoms, is_smoker)
yao

See the previous lesson for more information about this data.

3 Introducing mutate()

_Fig: the `mutate()` function. (Drawing adapted from Allison Horst)_

Fig: the mutate() function. (Drawing adapted from Allison Horst)

We use dplyr::mutate() to create new variables or modify existing variables. The general structure is: df %>% mutate(new_column_name = what_it_contains)

3.1 Changing a variable

dplyr::mutate() is THE tool for changing a variable in place. Here we are converting our height in centimeters to a height in meters. We choose to make a new variable to have a name that reflects this conversion, but we could have very well done an in-place change, such as: mutate(height_cm = height_cm/100).

yao %>%
  mutate(height_m = height_cm/100)
# let's save this variable for later:
yao <- yao %>%
  mutate(height_m = height_cm/100)

3.2 Creating a variable from scratch

dplyr::mutate() is also the tool at your disposition to create an entirely new variable useful for your analysis, such as an index or a ranking. Here you can find an example of an index. The function seq(start of index:end of index) uses the : operator to indicate the range of the index and where the end of the index is obtained with n(), which counts the number of entries (i.e. rows) in the yao dataset.

yao %>%
  mutate(record_number = seq(1:n()))

If you would like to create a rank the useful function dplyr::min_rank() are at your disposal. The functions dplyr::desc() allows to set the ranking in a descending order.

yao %>%
  mutate(rank = min_rank(desc(weight_kg)))

Convert the weight in kilograms into grams. Call your new column weight_g and submit your dataframe with your new column.

3.3 Creating a variable from existing variables

3.3.1 A boolean variable, based on the condition on another variable

You can easily create a boolean variable to categorize part of your population. Here we create a boolean varialbe, child which is either True if the subject is a child or False if the subject is an adult.

yao %>%
  mutate(child = age <= 18)

Create a boolean variable for the symptoms variable: set a condition based on No symptoms. You should name the column symptoms_boolean. If someone had no symptoms, they should be set to 0, and if they had any symptom, they should be set to 1.

3.3.2 A new numeric variable, combination of other variables

A common health indicator is the body mass index (BMI) which helps in categorizing the person’s global health by reflecting whether a person is overweight/underweight or not.

\[ BMI = \frac{weight (kilograms)}{height (meters)^2} \]

yao %>%
  mutate(BMI = (weight_kg / (height_m)^2))
# Let's keep this variable for later!
yao <-
  yao %>%
  mutate(BMI = (weight_kg / (height_m)^2))

You can also imagine adding columns together as so: mutate(z = x + y). As well as doing a logarithmic transform of a variable as so: mutate(z = log(z)). Basically, you can get anything and everything done with dplyr::mutate()!

A handy argument for the mutate function: .keep. .keep as its name indicates, allows you to decide to keep or drop the variables involved in dplyr::mutate().

  • If you want to keep all the variables involved in mutate, you can set the .keep argument to used (keeping all the variables that have been used).

  • If you want to drop the variables used in mutate, for example the height and weight used to calculate the BMI, you can set the .keep argument to unused (keep all the variables that were not used within dplyr::mutate()).

  • If you want to only keep the new variable created or the variable you changed using dplyr::mutate(), you can set the .keep argument to none (keep only the result of dplyr::mutate()).

Let’s try an example, setting .keep to unused i.e. dropping the height and weight variables after creating the BMI variable.

yao %>%
  mutate(BMI = (weight_kg / (height_m)^2), .keep = "unused")

4 Mutating with a condition: dplyr::case_when()

_Fig: the `case_when()` function. (Drawing adapted from Allison Horst)_

Fig: the case_when() function. (Drawing adapted from Allison Horst)

A healthy BMI is defined between 18,5 and 25. The person has a normal weight.

# Let's keep this variable for the next part!

yao<-
  yao %>%
  mutate(BMI_classification = case_when(BMI<18.5 ~'Too thin',
                                        BMI>=18.5 & BMI<=25 ~ 'Normal weight',
                                        BMI >25 & BMI <= 30 ~ 'Overweight',
                                        BMI >30 ~ 'Obese'))
yao

Create a variable called covid19_risk encompassing the risk factors of smoking and age for COVID-19. Define the profiles (these are approximates based on the overall medical consensus) as follows, using case_when:

  • High risk : a smoker, aged above 70

  • Moderate risk: an ex-smoker, aged above 70 OR a smoker, aged between 60 and 70

  • Low risk : an ex-smoker, aged between 60 and 70 OR a smoker, aged between 50 and 60

5 Changing a variable’s type

Often in a data analysis, depending on how you read in the data, you may need to do some data processing where you redefine the type of your variable. (Quick example: you may have a number that is written as a string when you want to handle it like a double or an integer.)

The main functions to change a variable type are as.character(), as.factor(), as.integer()

5.1 Factors: as.factor

For easier manipulation of your new variable BMI_classification it may be essential to transform it into a factor. Thankfully, you can do so with as.factor().

yao %>%
  mutate(BMI_classification = as.factor(BMI_classification))

If you want to reorder the levels of your factor variable (for a plot for example), you can use fct_relevel().

yao %>%
  mutate(BMI_classification = fct_relevel(BMI_classification, 
                                          "Obese", "Overweight", 
                                          "Normal weight", "Too thin"))

5.2 Character: as.character

Type transformations are malleable and easy. If you would like to reconvert BMI_classification to a string (for example, for writing text on a plot), then you can simply pass it through as.character() this time.

yao %>%
  mutate(BMI_classification = as.character(BMI_classification))

5.3 Integer: as.integer

The malleability also extends to numeric types. You can easily convert an integer to a double and a double to an integer.

yao %>%
  select(BMI) %>% 
  mutate(BMI_int = as.integer(BMI),
         BMI_dbl = as.double(BMI_int))

If you want rounded numbers, you can use the round() function, like: mutate(BMI_round = round(BMI))

If you apply the as.integer() function to a factor variable, then your factor levels will be coded in a binary manner.

5.4 Dates: as.Date

A final function worth mentioning is the as.Date() function. It allows to take a string in the format YYYY-MM-DD (Y: year, M: month, D:day) and make it a Date variable. There are numerous advantages to a Date object, such as being able to compare them using all the common operators (<, >, ==, etc)

yao %>%
  mutate(date_surveyed = as.Date(date_surveyed))

Transform the type of the is_smoker variable into factors. Keep the same column name.

6 Performing a transformation on multiple variables: dplyr::across()

6.1 Applying a predefined R function “across” multiple columns

Imagine that you want to do some complex string operations on some of your variables, for further reports or figures. Then maybe you would find it useful to have all those numbers (weight, height, BMI) as characters instead of numbers. Excluding your dates’ variable, date_surveyed, you would apply the as.character() transformation across all columns not equal to the dates variable !date_surveyed.

dplyr::across() is applied following this schema:

across(statement_defining_multiple_columns, function_to_apply_across_all_columns)

The statement defining multiple columns can be:

  • a list of names : c("height_m", "weight_kg", "age") OR c(height_m, weight_kg)

  • a condition: !sex OR where(is.numeric)

yao %>%
  mutate(across(!date_surveyed, as.character))

6.2 Applying a custom function “across” multiple columns

There are extensive predefined R functions that you can use in dplyr::across() but once in a while, you need to write your own function.

Imagine you want to normalize the heights and weights of the different participants to use this data for further statistical analysis.

Imagine that you want the values of the distribution X to be in a 0-1 range: you want to make your own min-max normalization function of each element x of the distribution X.

\[ x_{normalized} = \frac{x - min(X)}{max(X) - min(X)} \]

The tilda ~ introduces your function. The .x references the columns one by one across which you are applying the function: it allows to apply the function on the variables one by one.

yao %>%
  mutate(across(c("height_m", "weight_kg"), 
                ~ (.x - min(.x)) / (max(.x) - min(.x)) , 
                .names = "normalized_{.col}"),
         .keep="unused") 

Now let’s normalize the height (in meters), the weight and the age using a mean-standard deviation normalization.

Set the argument .keep to unused and name the new columns using the .names argument as above (.names = "mean_std_normalization_{.col}").

The formula consists in normalizing element x of distribution X using the mean and standard deviation of X as follows:

\[ x_{normalized} = \frac{x-mean(X)}{std(X)} \]

6.3 Mixing custom and predefined functions to transform “across” multiple columns

Just a heads up ! This is next level: so look into it, inspire yourself, but it’s alright if it appears too complex. Also, maybe this code can be useful for your projects, feel free to copy-paste.

You can also mix-match between your custom function and predefined functions. Such an example would be to use dplyr::case_when() across multiple columns.

Imagine that you want to remove the NA from your categorical variables (let’s use is.character()), and set them to Unknown. For all none NA entries, you want to keep the existing value (referenced by .x as explained in the Key Point above).

yao %>% 
  mutate(
    across(where(is.character), 
           ~case_when(is.na(.x) ~"Unknown", 
                      !is.na(.x) ~ .x) , 
                .names = "unk_{.col}"),
         .keep="used") %>% 
  count(is_smoker,unk_is_smoker)

Contributors

The following team members contributed to this lesson:

References

Some material in this lesson was adapted from the following sources:

Artwork was adapted from: